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Abstract. Increasingly, biologists are constructing evolutionary trees on large numbers of 
overlapping sets of taxa, and then combining them into a 'supertree' that classifies all the 
taxa. In this paper, we ask how much coverage of the total set of taxa is required by these 
subsets in order to ensure we have enough information to reconstruct the supertree uniquely. 
We describe two results - a combinatorial characterization of the covering subsets to ensure 
that at most one supertree can be constructed from the smaller trees (whatever trees these 
may be) and a more liberal analysis that asks only that the supertree is highly likely to be 
uniquely specified by the tree structure on the covering subsets. 



1. Introduction 

The scale of phylogenetic analysis has been growing steadily both in the number of taxa and 
the number of loci. Data from different loci are combined either directly into a single inference or 
indirectly by first building trees from each locus and combining trees as a "supertree" . Regardless 
of approach, large-scale phylogenetic data sets derived from genome resources [3] or mining 
databases like GenBank [H [51 [TD] tend to exhibit a high proportion of missing entries (taxa 
missing from taxa sets for different loci or input trees) - 55% to 96% in the papers just cited. 
Wiens [H] has argued that the effect of missing data on the accuracy of tree inference is minimal 
as long as these missing data are randomly distributed and counterbalanced by enough data 
overall. 

However, the pattern of missing entries is highly nonrandom, especially in the data mining 
studies, as the pattern is determined by numerous sample biases in the databases (for exam- 
ples, see the PhyLoTA Browser database [^). Moreover, few analytic results are available to 
complement simulation based studies of this problem. 

In this paper, we mathematically address the question of whether a given collection of subsets 
of taxa would suffice to reconstruct a tree uniquely for all the taxa, if we can infer a tree correctly 
on each of the subsets. Our study is also motivated by some recent mathematical work concerning 
supertree construction under various taxon coverage conditions, [Tl [51 113|. 

1.1. Definitions. We begin by recalling some basic definitions from phylogenetic theory. Fol- 
lowing [9|, given a set X of taxa, a binary phylogenetic X-tree is a tree T in which the degree 1 
vertices (leaves of T) consist of the set X and all the remaining vertices of T are unlabelled and of 
degree 3. Fig. 1 shows two of the 15 distinct binary phylogenetic X-trees for X = {1,2,3, 4, 5}. 
For a binary phylogenetic tree T and a subset F of X, let T|y denote the induced binary phyloge- 
netic tree on leaf set Y (the tree obtained from the minimal subtree connecting Y by suppressing 
any vertices of degree 2). A quartet tree is a binary phylogenetic tree on four leaves. For such 
a tree, with leaves a, 6, c, d, we write ab\cd if the interior edge of the tree separates the pair a, b 
from c, d. 
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Let 5 be a collection of subsets of a set X, and let n = \X\ throughout. We say that S is 
phylogenetically decisive if it satisfies the following property: If T and T' are binary phylogenetic 
X-trees, with T\Y = T'\Y for all Y ^ S, then T = T' . In other words, for any binary phylogenetic 
X-tree T, the collection of induced subtrees {T\Y : y e 5} uniquely determines T (up to 
isomorphism) . 

Let Qs be the set of all quartets from X that lie in at least one set in S. That is: 

Yes ^ ^ 

Note that S is phylogenetically decisive if and only if Qs is phylogenetically decisive since T\Y = 
T'\Y if and only \i T\q = T'\q ior sM q ^ (^) [T2j. It is easily shown that if S is phylogenetically 
decisive then: 

^ ^ Yes ^ ^ 

In other words, all three-taxon subsets of X must be present as a subset of some element Y of 
S (this is Lemma 6.2.1 of [5]). However, this necessary condition for phylogenetic decisiveness 
can be shown to be insufficient (an example is provided in [5]). One sufficient condition has been 
known since 1992 |12| : namely if Qs contains all quartets of the form {xq, x, y, z} for some fixed 
xq £ X, and all distinct x,?/, z G X — {xq}, then S is phylogenetically decisive. However, this 
sufficient condition is not necessary, as the following example shows. 

1.2. Example. Let 5 = {{1, 2, 3, 4}, {1, 2, 3, 5}, {2, 3, 4, 5}, {1, 3, 4, 5}}. Then 5 is a phylogeneti- 
cally decisive collection of subsets oiX — {1, 2, 3, 4, 5, }, that is, each of the 15 binary phylogenetic 
X-trees is determined by the collection T\Y for Y <^ S. For example, for the two trees T, T' 
in Fig. 1., we have different induced quartet trees T\Y = 12|34,T'|y = 13|24 by selecting the 
taxon set Y = {1, 2, 3, 4} from S.. Notice that in this example, no element of X lies in every set 
in S. Theorem [5] below will allow us to easily verify that S is phylogenetically decisive. 




T T' 



Figure 1. Two binary phylogenetic X-trees for X = {1,2,3,4,5}. If we take 
Y = {1, 2, 3, 4} then T\Y = 12|34 and T'\Y = 13|24. 

2. Characterizing decisiveness 

Our first result provides a purely combinatorial characterization for phylogenetic decisiveness, 
and this is provided as follows. We say that a collection S of subsets of X satisfies the four-way 
partition property (for X ) if, for all partitions of X into four (disjoint) sets Ai, A2, A^, A4 (with 
AiL) A2Li A3U A4 — X) there exists G Ai for i — 1,2,3,4 for which {ai, 02, 03, 04} e Qs- We 
begin with a useful lemma. Recall that a cherry of a tree is a pair of leaves that are adjacent to 
the same vertex. 

Lemma 1. Suppose that T is a binary phylogenetic X-tree, and a,b E X . 
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(i) If a, b forms a cherry ofT, then for every pair of subsets C, D that partition X — {a, h}, 
and every c G C,d E D, we have T\{a, b, c, d} = ab\cd. 

(ii) Conversely, if a,b does not form a cherry ofT, then there exists a pair of subsets C,D 
that partition X — {a, b}, such that, for every c G C and every d & D, T\q is different to 

ab\cd. 

Proof: Part (i) of the claim is clear. For part (ii), we show that if a, 6 is not a cherry of T 
then we can construct a partition oi X ~ {a, b} that satisfies the property described. Consider 
the path in T connecting a and b. If a, 5 is not a cherry, this path has at least two trees hanging 
off it. Let C be the leaf set of the hanging tree that is closest to a, and let D = X — C — {a, b}. 
Then, regardless of which element c we select in C and which element d we select in D, we have 
T\{a,b,c,d} = ac\bd. □ 

Theorem 2. A collection S of subsets of X is phylogenetically decisive if and only if S satisfies 

the four way partition property for X . 

Proof: We first show that the condition is necessary. Suppose, to the contrary, that a four- 
way partition oi X exists as described, but without a quartet {01,02,03,04} £ Qs with o^ G Ai. 
Let T be any binary phylogenetic tree, obtained by taking arbitrary binary rooted phylogenetic 
trees on leaf sets ^1, ^2, ^3, ^4, and identifying the roots of these four trees with the leaves of a 
quartet tree. Let T' be one of the two trees obtained from T by performing a nearest neighbor 
interchange about the central edge of the quartet (to which the four rooted trees were attached). 
The only quartets from X that T' resolves differently from T are quartets that contain one leaf 
from each of the sets Ai, A2, A3, and we have assumed there is no such quartet in Qs- This 
shows that S is not phylogenetically decisive. 

We next show that the condition is sufficient. Suppose that T is any binary phylogenetic X- 
tree. We must show that no other binary tree displays the collection of quartet trees Qt ■= 
{T\q : q £ Qs}- We will use induction on \X\. The result clearly holds for |X| = 4. Now suppose 
T has n > 4 leaves and that T' is a phylogenetic tree that displays Qt- We will show that 
T' = T. 

Lemma 1 allows us to use Qs to identify when a pair a, 6 is a cherry of T. The argument 
is as follows. For each pair a,b, consider all choices of C,D that partition X — {a,b}. By our 
assumption concerning S (taking Ai = {a},A2 = {b},A3 = C,A4 = D), it follows that there 
exists c £ C,d € D such that {a,b,c,d} € Qs, and so some resolution of a,b,c,d is in Qt- If 
this resolution is different from ab\cd then we discard a,b as a candidate for being a cherry of 
any tree that displays Qt, including T and T' (by Lemma l(i)). On the other hand, if for every 
choice of C, D, we have the resolution ab\cd in Qt then a, b must be a cherry of every tree that 
displays Qt, including T and T' (by Lemma l(ii)). 

Now, consider the set X' obtained from X by deleting b, and let S' be the collection of subsets 
of X obtained from S by replacing each occurrence of 6 in F € <S by a (if a, b appear together 
in some set Y, then we simply delete b from that set). We claim that if S satisfies the four-way 
partition property for X then <S' satisfies this property for X'. Consider a partition Ai, A2, A^, A4 
of X'. The element o lies in one of these sets - let us say Ai- Consider the four-way partition of X 
given by: yliU{6}, A2, A3, ^4. Then, by assumption, there exists oi £ AiU{6}, Oj S Ai{i = 2, 3, 4) 
with {ai, a2, ct3, 04} G Qs- Now, b is not one of 02,03,04, and so, regardless of whether oi is a, 
or b or neither, we have {01,02,03,04} € Q's (Note that if y is a set in S of size 4 containing 
o, b then this set will not produce a quartet in Q'^, since on deleting b, we obtain a set of size 3 
- however this does not create a problem since we have at least three elements of X — {a, b} in 
{01,02,03,04}). 

Let Th = T\X' be the binary phylogenetic X'-tree obtained from T by deleting leaf b (and 
its incident edge), and let Qt^ = {Th\q : q S Qs'}- By induction (noting that \X'\ < \X\ and 
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that S' satisfies the four-way partition property for X'), Ti, is the unique tree that displays Qt^. 
However, the tree T^' obtained from T' by deleting leaf b also displays Qt^ since T' displays Qs 
and a, 6 is a cherry of T'. Thus T^' = Tb and thus, T' = T as required □ 

2.1. Remarks. 

• The fact that if S is phylogenetically decisive then every element subset {a, 6, c} of X must 
be contained in a quartet within some set Y E S follows immediately from Theorem [5] 
by taking A = {a}, B = {&}, C = {c} and D = X - {a, b, c}. 

• The argument in the proof suggests an algorithm for building a tree based on identifying 
a cherry and recursion. 

• The computational complexity of determining whether an arbitrary collection of S of 
subsets of X is phylogenetically decisive seems an interesting question, since the number 
of all four-way partitions is exponential. For practical applications, a simple but fast 
measure for quantifying the degree of phylogenetic decisiveness of a set S would be to 
generate uniformly at random a large number of four-way partitions of X and ask for 
what proportion of the resulting 4-way partitions {Ai, A2, A3, A4) there exists Ui £ Ai 
for i = 1,...,4 with {01,02,03,04} £ Qs- If this proportion is strictly positive then 
S is not phylogenetically decisive but it may still be of interest to know how 'close' to 
phylogenetically decisive it is by this measure. 

• The concept of phylogenetic decisiveness is related to, but different from, the weaker 
concept of a phylogenetic 'grove' from [UIH]. 



3. Decisive sets for random trees 

The combinatorial condition for phylogenetic decisiveness is very strong, and in this section 
we describe a condition that reflects the fact although all trees might not be determined by 
how they resolve certain sets of taxa that cover X, nearly all trees will be. For a collection 
S ~ {Yi, . . . , Yfc}, we say that S is decisive for a tree T provided T is the only tree that displays 
T\Yi, . . . ,T\Yk. Thus if S is decisive then it is decisive for every tree T, but the converse is 
certainly not true - for instance, for every binary phylogenetic tree T, there is a set of just n — 3 
quartets for which S is decisive for T [12 . For example, in Fig. 1, the set {{1, 2, 3, 4}, {1, 3, 4, 5}} 
is decisive for T but not for T'. 

By a random tree, we mean a binary phylogenetic X-tiee chosen uniformly at random from 
the set of {2n — 5)!! binary phylogenetic X-trees. Thus we can talk about the probability that a 
given S is decisive for a random tree (it is simply the proportion of binary phylogenetic X-trees 
for which S is decisive). Note that if S consists of two sets Yi, 1^2 and each of these sets contains 
a taxon that is not in the intersection Yi n Y2 then S is not decisive, even if just one taxon is 
unique to Yi and to Y2 (provided the intersection contains at least two elements). However, the 
following result shows that S is very likely to be decisive for a random tree, even when several 
(but not too many) taxa lie outside the intersection. 

Theorem 3. Let S = {Yi,Y2} where fc = |Yi n ^2! and Ai = |Yi - Yi n Y2I, A2 IY2 - Yi n Y2I. 
Let p := F[S is decisive for a random tree T]. Then: 



(ii) P<^^p(- a,+a'V\%/2 )- 
In particular, if X1X2 — o{k) then p = 1 — o(l). 

Proof: First observe that if Ai = or A2 = then p — 1, and both (i) and (ii) apply (as tight 
bounds), so we will henceforth assume that Ai, A2 > 0. 
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Let Y := Yi nY2, and consider T\Y. If T is a random tree with leaf set X, then T\Y is a 
random tree with leaf set Y. Moreover, we can generate a binary phylogenetic X-tree uniformly 
at random by the following randomized leaf-attachment process [llj . Take a given ordering of 
the taxa (note that this ordering is not necessarily selected randomly) . We construct a sequence 
of trees beginning with a tree consisting of the first two taxa in the ordering, connected by an 
edge, and ending with the tree T. The process of constructing the next tree in the sequence from 
the previous is as follows: Select one of the edges of the tree so far constructed uniformly at 
random, subdivide this edge and make the midpoint adjacent to a new leaf (via a new edge) that 
is labelled by the next taxon in the ordering that has not appeared in the tree so far constructed. 

With reference to Y, Yi , I2 , we will select an ordering where the taxa in Y come first, then those 
in Yi — y and finally those mY2 — Y (any such ordering satisfying this constraint is adequate) to 
obtain a random binary tree (with uniform probability) on leaf set X :— YilJY2. Note that each 
element of y of Yi — F or Y2 — Y has a unique nearest edge e of T\Y, which we will denote 
by e{y). Moreover, the condition for T to be the only tree that displays the induced trees T\Yi 
and T\Yi is that the sets of edges Ei :— {e{y) : y G Yi — Y} and E2 = {e{y) ■ y £ Y2 — Y} are 
disjoint subsets of the total set of edges of T\Y (by Theorem 1 of [5]). Conditional on T\Y and 
El, consider the probability p' — p'{T\Y, Ei) of the event that the leaf attachment of the leaves 
in Y2 — Y results in E2 being disjoint from Ei (and, as noted, this event implies that T is the 
only tree that displays T\Yi and T\Y2). We have: 

, X (x + 2) (a; + 2(A2-l)) 

(1) P - 



(A + 2A1) (A + 2A1+2) (A + 2A1 + 2(A2 - 1))' 

where A = 2fc — 3 is the number of edges of T\Y, and x = A — |{e(j/) : y ^ Ei}\. 
Proof of (i): From Eqn. ([T]) we have: 

(2) P' > 



A + 2A1 



Now, the smallest possible value of the right-hand side term in ^ over all choices of T\Y, Ei, is 
realized when x takes its smallest possible value - or, equivalently, when Ei takes its maximal 
possible value of Ai (i.e. e{y) is a different edge of T\Y for each y e Yi — Y), in which case 
a; = A — Ai. Substituting this into ([T]) gives: 

P > [ r- = r-^ > 1 r-] > 1 - BAiAs/A. 

This lower bound is conditional on the two random variables T\Y and Ei] however, it depends 
only on these only via the quantities A and Ai, A2 which are fixed in advance, and so the bound 
applies also without conditioning. This completes the proof of (i). 
Proof of (ii): From Eqn. ([T]), we have: 



(3) P' < 

and since a; < A — 1 < A, we have: 



2A2 -2 



X + 2X1 + 2X2 - 2 



/ ^ / , 2Ai ^ ( A1A2 

p < 1 - — — < exp 



A + 2Ai-h2A2-2y - V Ai A2 + A/2- ly ' 
from which (ii) now follows for similar reasons to the conclusion of the proof of part (i). 



□ 
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3.1. Remark. Using the theory of Polya Urn models, one could, in principle, obtain an exact 
but complex expression for p in Theorem |31 However, a more interesting problem would be to 
obtain bounds for the probability that S is decisive for a random tree, when S consists of more 
than two sets. 

References 

[I] Ane, C, O. Eulenstein, R. Piaggio-Talice, and M. J. Sanderson (2009). In press. Groves of phylogenetic trees. 
Annals of Combinatorics. 

[2] Bryant, D., Bocker, S., Dress, A.W.M., and Steel, M. (2000). Algorithmic aspects of tree amalgamation. 

Journal of Algorithms 37: 522-537. 
[3] Dunn, C. W., A. Hejnol, D. Q. Matus, K. Pang, W. E. Browne, S. A. Smith, E. Seaver, G. W. Rouse, M. 

Obst, G. D. Edgecombe, M. V. Sorensen, S. H. D. Haddock, A. Schmidt-Rhaesa, A. Okusu, R. M. Kristensen, 

W. C. Wheeler, M. Q. Martindale, and G. Giribet. (2008). Broad phylogenomic sampling improves resolution 

of the animal tree of life. Nature 452: 745-U5. 
[4] Goloboff, P. A., S. A. Catalano, J. M. Mirande, C. A. Szumik, J. S. Arias, M. Kallersjo, and J. S. Farris. 

(2009). Phylogenetic analysis of 73,060 taxa corroborates major eukaryotic groups. Cladistics 25: 211-230. 
[5] Humphries, P. J. (2008). Combinatorial aspects of leaf-labelled trees. PhD thesis. University of Canterbury, 

Christchurch, New Zealand. 

[6] McMahon, M. M., and M. J. Sanderson. (2006). Phylogenetic supermatrix analysis of GenBank sequences 

from 2228 papilionoid legumes. Syst. Biol. 55: 818-836. 
[7] PhyLoTA Browser database: |http : / /loco. biosc i .arizona.edu/pb' 

[8] Sanderson, M. J., C. Ane, O. Eulenstein, D. Fernandez-Baca, J. Kim, M. M. McMahon, and R. Piaggio- 
Talice (2007). Fragmentation of large data sets in phylogenetic analysis. In Reconstructing Evolution: New 
Mathematical and Computational Advances (O. Gascuel, and M. Steel, eds.). Oxford University Press, Oxford. 

[9] Semple, C. and Steel, M. (2003). Phylogenetics. Oxford University Press, Oxford. 

[10] Smith, S. A., J. M. Beaulieu, and M. J. Donoghue (2009). Mega-phylogeny approach for comparative biology: 
an alternative to supertree and supermatrix approaches. BMC Evolutionary Biology 9. 

[II] Steel, M.A. and Penny, D. (1993). Distributions of tree comparison metrics - some new results. Systematic 
Biology 42(2): 126-141. 

[12] Steel, M. (1992). The complexity of reconstructing trees from qualitative characters and subtrees. Journal of 
Classification 9: 91— 116. 

[13] Steel, M. and Rodrigo, A. (2008). Maximum hkehhood supertrees. Systematic Biology 57: 243 - 250. 
[14] Wiens, J. J. (2006). Missing data and the design of phylogenetic analyses. Journal of Biomedical Informatics 
39: 34-42. 

Mike Steel: Allan Wilson Centre for Molecular Ecology and Evolution, Department of Math- 
ematics AND Statistics, University of Canterbury, Christchurch, New Zealand; Michael Sanderson: 
Department of Ecology and Evolutionary Biology, University of Arizona, Tucson, AZ 85721 

E-mail address: m. steelSmath. canterbury . ac. nz 



